The Performance of Work Stealing in Multiprogrammed Environments

نویسندگان

  • Robert D. Blumofe
  • Dionisios Papadopoulos
چکیده

As small-scale, shared-memory multiprocessors make their way onto desktops, the high-performance parallel applications that run on these machines will have to live alongside other applications, such as editors and web browsers. Unfortunately, unless parallel applications are coscheduled [4] or subject to process control [2], they display poor performance in such multiprogrammed environments [2]. As an alternative to coscheduling or process control, we investigate the use of dynamic, user-level, thread scheduling. In particular, we show that a non-blocking [3] implementation of the work-stealing thread-scheduling algorithm [1] achieves efficient performance even when the number of available processors grows and shrinks over time. All of the experiments in this paper were run on a Sun Ultra Enterprise 5000 with 8 167-Mhz UltraSPARC processors running Solaris 2.5.1, with no modifications. The work-stealing thread scheduler studied in this paper has been implemented in a C++ threads library called Hood, and our applications are coded in C++ using Hood. Hood is implemented on top of the Solaris threads library. Traditionally, multithreaded applications use static partitioning and display poor performance when run in a multiprogrammed environment. Such a program creates some number P of (lightweight) processes (also known as kernel threads), and each process performs a 1=P fraction of the total work. Let T1 denote the work of the computation, which we define as the execution time with P = 1 process running on 1 dedicated processor. With P > 1 processes running on P dedicated processors, if the overhead of creating and synchronizing these processes is small compared to the T1=P work per process, then the execution time will be TP = T1=P , thereby giving a speedup of T1=TP = P . In a multiprogrammed environment, however, we might find that the actual number PA of processors on which the program runs is smaller than the number P of processes. In this case, we can still hope to achieve an execution time of TP = T1=PA, thereby giving a speedup of T1=TP = PA. Unfortunately, as Figure 1(a) shows, this hope is not being realized. This figure shows the measured speedup of several statically partitioned applications for different numbers P of processes.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dynamic Memory ABP Work-Stealing

The non-blocking work-stealing algorithm of Arora, Blumofe, and Plaxton (hencheforth ABP work-stealing) is on its way to becoming the multiprocessor load balancing technology of choice in both Industry and Academia. This highly efficient scheme is based on a collection of array-based deques with low cost synchronization among local and stealing processes. Unfortunately, the algorithm’s synchron...

متن کامل

Hood: A User-Level Threads Library for Multiprogrammed Multiprocessors

The Hood user-level threads library delivers efficient performance under multiprogramming without any need for kernel-level resource management, such as coscheduling or process control. It does so by scheduling threads with a non-blocking implementation of the work-stealing algorithm. With this implementation, the execution time of a program running with arbitrarily many processes on arbitraril...

متن کامل

Dynamic Processor Allocation for Adaptively Parallel Work-Stealing Jobs

This thesis addresses the problem of scheduling multiple, concurrent, adaptively parallel jobs on a multiprogrammed shared-memory multiprocessor. Adaptively parallel jobs are jobs for which the number of processors that can be used without waste varies during execution. We focus on the specific case of parallel jobs that are scheduled using a randomized work-stealing algorithm, as is used in th...

متن کامل

History-Based Adaptive Work Distribution

Exploiting parallelism of increasingly heterogeneous parallel architectures is challenging due to the complexity of parallelism management. To achieve high performance portability whilst preserving high productivity, high-level approaches to parallel programming delegate parallelism management, such as partitioning and work distribution, to the compiler and the run-time system. Random work stea...

متن کامل

Efficient Work Stealing for Portability of Nested Parallelism and Composability of Multithreaded Program

We present performance evaluations of parallel-for loop with work stealing technique. The parallel-for by work stealing transforms the parallel-loop into a form of binary tree by making use of method of divide-and-conquer. Iterations are distributed in the leaves procedures of the binary tree, and the parallel executions are performed by stealing subtrees from the bottom of the tree. The work s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997